Supplementary Information: Sparse Representations of High Dimensional Neural Data

Section 1 provides supplementary information for the introductory section of the main manuscript. Section 2.1 provides a quick review of standard VAR modeling. Section 2.3 reviews the log likelihood ratio for comparing nested VAR models for testing the signiﬁcance of Granger Causality, discusses the SVARGS algorithm in a manner applicable to blocks and has further comments on the algorithm such as choice of p entry value. Section 3 describes all the quantities that can be computed directly from the VAR model without any further data analysis. Section 4 includes additional graphs showing performance measures for our methods. Section 5 outlines eﬃcient methods for computing Autocovariance and the conditional GC matrix. Section 6 discusses the eﬃcient implementation of the SVARGS.


entation of the SV
RGS.

Neuronal Networks

CoCoMac is a coordinate-independent collection of annotations compiled from "over 2,500 anatomical tracer injections" and from "over 400 published experimental studies" that captures 10,681 anatomical connectivity relations and 16,712 overlap (or mapping) relations between pairs of annotated brain regions.Based on these relations, Modha et al construct a network hierarchy consisting of 383 brain regions and 6, 602 directed edges spanning the entire Macaque brain.The undirected version of the network consists of 5, 208 edges which gives an average (undirecte

degree of about 27.


VAR modeling 2.1 Standard method using
Yule-Walker equations

We briefly summarize the standard method of estimating the VAR coefficients.The maximum likelihood method is used to derive estimates for the coefficients A i .Suppose that the time series data Ŷ , composed of k variables, and N observations at regular time intervals, can be modeled by Main [2].Denoting the observed values of t by e t , the Log-Likelihood, l , of the observed noise (see [1, pg 346, section 5.4.1]), is given by: l (e p+1 , . . ., e N , Σ ) = (N − p)
2 log |Σ −1 | (2π) k − N t=p+1 1 2 T t Σ −1 e t (2.1.1)
Maximizing (2.1.1)is clearly equivalent to minimizing the sum of squares, S :
S ({A i }) = N p+1 e T t Σ −1 e t (2.1.2)
The invertible linear relationship between the variables { p+1 , . . ., N } T and {Y p+1 , . . ., Y N } T implies that there is a one-to-one correspondence between the observed values Ŷ of Y and the observed values e of , and also that the Jacobian of the transformation from is equivalent to maximizing the log-likelihood function of Ŷ tions: G 1 G 2 G 3 . . . G p−1 G 2 G 3 G 4 . . . G p−2 . . . . . . G p−1 G p−2 G p−3 . . . G 0                 A 1 A 2 A 3 . . A p         =         G 1 G 2 G 3 .
. G p        
(2. 1.4) where:
G i = 1 N N t=p+1 Y t Y T t−i (2.1.5)
The k × k matrices R(i) are the usual estimates for the auto-covariance function of the process Y t .The equations (2.1.4)are the Yule-Walker equations using estimated autocovariances.

The solution obtained by (2.1.4)above is often not satisfactory, since for large systems with only moderate amounts of data, the amount of noise in the coefficients makes identification of the actual non-zero coefficien s difficult.Moreover, variations on ths method based on identification of the best model order based on information criteria such as Akaike Information Criterion (AIC) or Schwartz Bayes Criterion (SBC) all

ad to the same issue of noise in the coeffici
nts.

For the procedures to be described, we need to use regression with the lagged values of Ŷ as predictors.

Below, we show this kind of regression is valid and is a generalization of the Yule-Walker equations.


Coefficients with constraints via regression

The coefficients A i linear Least-Squares regression on the data with the lagged values of Ŷ as predictors.More precisely, if y (r) , r = 1, 2, . . ., k, are the individual components of Y , set:
x (r) i = y (r) t−i (2.2.1) X = k r=1 {x (r) 1 , x(r)
2 , . . ., x (r) p } (2.2.2)

If for each r = 1, 2, . . ., , k, we perform a linear Least-Squares regression of the observed values ŷ(r) , of y (r) , on the observed values X of X, the coefficients obtained are identical to those from (2.1.4).This fact is easy to see from the standard formula for the linear regression = [a r 1 , a r 2 , . . ., a r p ] T is the column vector of coefficients obtained by concatenating the r th rows of the matrix coefficients A 1 , A 2 , . . ., A p .Putting the k equations above together, we get:
X X[β 1 , β 2 , . . . , β k ] = X Y (2.2.3)
which on inspection will be seen to be identical to (2.1.4).Now suppose that [a qrs ] = c (2.2.4)
Here a qrs is the element from the r th row and s th column of A q , C is a matrix with lk 2 p columns, and c is a column vector of length l < k 2 p where l is the number of constraints.The column vector [a qrs ], of length k 2 p, is formed by running over all q, r,

presents a general linear relation b
tween the set of all coefficients.For such constrained coefficients, we can use the Lagrange Multiplier Method to minimize (2.1.3)taken along with (2.2.4).

The question arises whether we can continue to estimate the coefficients A i as b ut yes if the constraints are of a special form.In general coefficient estimates obtained by using the lagrange multiplier method on equations (2.1.3)and (2.2.4) are not the same as those obtained by performing linear Least-Squares regression on the individual equations.This is because Least-Squares regression on individual components will fail to account for inter equation constraints.However in the special case when (2.2.4) consists only of intra-equation constraints of the form:
C r [ âr 1 , âr 2 , . . . , âr p ] T = c r , r = 1, 2, . . . , k (2.2.5)
where each C r is a matrix with kp columns and lr < kp rows, a r q is the r th row of the matrix A q and each c s is a vector of length ls , the lagrange multiplier equations simplify to give the individual regression equations as before.In particular, subset constraints of the form:
a qrs = 0 (2.2.6)
for particular sets of indices {q, r, s}, are tri equivalent to performing a separate linear regression on each of only those regressors in the set X (see (2.2.1) that have non-zero coefficients need to be consid

The value of the maximum likeliho
d of the observed data can now be found by substituting these estimates into equation (2.1.1).The value of the log likelihood evaluated at Âi is:
l(Σ , { Âi }) = (N − p) 2 log |Σ −1 | 2 k π k − N t=p+1 1 2 e T t Σ −1 e t (2.3.1)
where e t = y t − p i=1 A i y t−i Now in practice, we do not know the value of the theoretical error covariance Σ in advance.So we have to treat it as a parameter to be estimated.We calculate its maximum likelihood e

imate from the expression for the log-lik
lihood.Equations (2.1.3)already maximize the likelihood with respect to {A i } and moreover the estimate for {A i } is independent of Σ .We therefore only need maximize (2.3.1) with respe ect to Σ −1 and equating to 0. To do this we calculate the derivative with respect to the ij th component σ ij of Σ −1 and use the symmetry of Σ .We use the fact that for any matrix U , the differential of the determinant, |U | with respect to its ij th entry u ij is (−1) i+j M ij dl ij where M ij is the minor of u ij .Also, the differential of e T t U e t with respect to u ij is just (e T t e t ) ij du ij .We then get:
∂l(Σ , {A i }) ∂Σ −1 = (N − p) 2 Σ − N t=p+1 1 2
e t e T t = 0 and hence the estimate for Σ which maximizes t N − p N t=p+1 e t e T t [ = Σ e ] (2.3.2)
Substituting this in (2.3.1), and or the maximum likelihood:
l({ Âi }) = (N − p) 2 log |Σ

e | 2 k π k e k (2.3.3) = − (N − p) 2 [log(|
e |) − k(1 + log(2π))] (2.3.4)

Time Series with multiple trials

To fit multitrial data with r trials, interleave the trials by placing, for every channel i and time point j, all the trial values at (i, j) adjacent to each other in the same row.and use regressors that are lagged by multiples of r only.For each equation, this procedure can easily seen to be theoretically equivalent to fitting the coefficients by inverting the average, over all realizations, of the regressor covariance covarianc and M 1 for a data set given by {y 1 , y 2 , . . ., y N }.Let the models correspond to parameter sets A (0) and A (1) respectively.The Likelihood Ratio:
Λ(A (1) : A (0) ) = L(Σ , A (1) ) L(Σ , A (0) ) (2.3.5)
is a measure of the superiority of the model M 1 over the model M 0 .If Λ(A (1) : A (0) ) > 1, we can consider model M 1 to be better than model M 0 .Hence we need to determine if the statistic Λ(A (1) : A (0) ) is significantly larger than unity.For this we need its probability distribution which, in general, can be difficult or complicated to find.However, in the case when Λ(A (1) :
A (0)
) is a ratio of maximum likelihoods, the following result known as W lk's theorem ( [2]), enab 0 ⊂ Ω, where Ω 0 is open in R m−λ , λ > 0.
Assume that the maximum likelihood estimates of θ have asymptotically (N → ∞) normal distributions.Let H 0 : θ = θ 0 be the null hypothesis.Then under H 0 , the log likelihood ratio:
r λ = 2log(Λ(θ : θ 0 ))
is asymptotically, distributed according to the χ 2 (λ) distribution, i.e.:

lim
N →∞ P (r λ < r) = F χ 2 (λ) (r)
where F χ 2 (λ) is the CDF of the χ 2 (λ) distribution.


Determine significance of estimated GC value

The concept of Granger Causality can be extended to use it in a lagwise manner instead of just a variablewise manner.Given the time series Y as above, suppose we have some existing model M 0 for it.Given the existing coefficients of the model, consider the Likelihood ratio restricted to a block of components Y to = {Y (j1) , Y (j2) , . . ., Y (js) } after entry of another disjoint block of com

nents Y f rom = {Y (i1) , Y (i2) , . .
., Y (ir) } at some lag q.From (2.3.3), this is given by:
R = (N − p 0 )log(|Σ to |) − (N − p)log(| Σto |) + 2k(p − p 0 )(1 + log(2π)) (2.3.6)
where |Σ to | is the error covariance of the to block in the original model M 0 , and | Σto | is the error covariance of the to block after adding the coefficient locations of the f rom block at lag q and refitting the model.p 0 and p are the orders of the model (restrict nd after adding the new coefficients.The third term in equation (2.3.6) is a penalty term somewhat analogous to the penalty term found in information criteria such as AIC.In our case, we fix a maximum possible lag, p max beforehand and use only data starting fr m p max + 1.This also has the advantage that genuine initial data is available from the data upto lag p.

(2.3.6) then reduces to:
R = (N − p ma | Σto |) (2.3.7) = (N − p max )gc f rom→to (2.3.8)
Because the coefficients are found using linear Least-Squares regression, (2.3.7) is a ratio of maximum likelihoods, and hence it follows fro above that (N − p max )gc f rom→to has asymptotically a χ 2 (λ) distibution, where λ is number of components in the f rom block.We can theref tly the same way as the Likelihood ratio.If the blocks are atomic (that is have only one variable), then this Granger Causality statistic i linear regression.The F - tatistic has an F (1, n − λ 0 − 1) distribution, where λ 0 is the number of coefficients in the model M 0 .For large n, this distribution approximates the χ 2 (1) distribution.The results in section Main [3] show that GC statistic above is an effective way to determine all the significant coefficients at different lags.


Confidence intervals for coefficients

The finite number of time point implies some uncertainity in the size of the coefficients obtained.In addition, this uncertainity, places a threshold of the size of the smallest coefficients that can be detected by the algorithm.In order to find the uncertainity in the computed coefficients, first note that for a random variable Z that is the sum of squares of identically normally distributed random variables
χ i ζ = 1 N i χ 2 i (2.3.9) the standard deviation s ζ of ζ is 1 √ N s χ
, where s χ is the common standard deviation of the variables χ i .Moreover Z is asymptotically normally distributed as N → ∞.In our case, where N is at least 100 or more, we can take ction Supp [6], equation Supp[(??)] that when a new coefficient c is added at some stage of the fit, the following equation holds:
∆σ = c 2 xT (I − H 0 )x
Here x is the new regressor associated with the location of the coeffient c, ∆σ is the change in the sum of squares of residuals due to the addition of the new coefficient and H 0 is given by:
H 0 = Q 0 Q T 0
where Q 0 is the orth rs X 0 before the new regressor was added.Now we have:
1 N ∆σ = 1 N i (e 2 i − e 2 0i )

ere e 0i and e i are the
omponents at time i of the residuals ē and ē0 , respectively, before and after the entry of c.So we can write:
c 2 = i e 2 i xT (I − H 0 )x − i e 2 0i xT (I − H 0 )x (2.3.10)
Now note that the response variable ȳ and the lagged regressors X are jointly normally distributed.Hence the residuals ē in the first term, which are a linear combination of the response variable ȳ, the regressors in X and the newly added regressor x are jointly normally distributed.In addition the residual vector ē is orthogonal to x which implies that x and and ē are independent random variables.Now it can be shown (see for example [3, Ch. 9, section 4]), that if M is an idempotent matrix, then xT M x has a χ 2 (N − r) distribution where r is the rank it follows that the denomi 0 )x of (2.3.10) has a χ 2 (N − r) distribution.(Note incidentally that, by our assumption of spars to N ).In addition the numerato a χ 2 (N ) distribution.In view of the independence of ē and x, the numerator and denominator are also independent (H 0 is also formed from lagged regressors) and hence the first term in (2.3.10) has an F(N, N − r) distribution with variance O( 1 N 2 ).For the sec we don't have independence of the numerator and denominator and hence we simply treat them as a ratio of two χ 2 distributions.The variance of the numerator and denominator are 2σe
N and 2σx N −r ≈ 2σx N .
Taking the square root of (2.3.10) roximations in the estimation of c is given by:
σ c ≈ 2 2 N s e s x + O( 1 N ) (2.3.11) ≈ 2 2 N s e s x (2.3.12)
where s e and s x are the common standard deviations, respectively of the e i and x i .


E try p-Value for SVARGS

We provide a rationale for our choice of p-values for coefficient entry in the SVARGS algorithm.The argument is heuristic and we relate it to the concept of the False Discovery Ratio (F DR).First note that for each coefficient entry, we test k 2 q coefficients (over all equations).For any particular coefficient whose true value is 0, the probability of a false entry at that step is p entry .Now, if a coefficient was falsely entered, but the true value of its likelihood statistic is unity, then the probability that it will remain at the removal step is p exit .For any such coefficient (whose true value is 0), let pn be the probability that it is in the entered stat at the n th step.Then we have the following recursion:
pn = p exit pn−1 + p entry (1 − pn ) (2.3.13)
This can be rewritten as:
pn = ap n + b
where a = p exit − p entry and b = p en

y Solving the recursive
equation upto the n th step gives:
pn = 1 − a n−1 1 − a b
Since a is small, this becomes:
pn ≈ p entry (2.3.14)
which is what one might naively assume without going through the above calculation.In other words, our naive intuition that p entry is the probability of a zero coefficient being falsely detected is justified.The expected number of false entries E ñ is therefore:
E ñ = k 2 q × p entry
We would like E ñ to be small compared to the final number of coefficients.How small is decided by the choice of the False Discovery Rate, F DR.In terms F DR, the number of false detections is F DR × M .Equating this to E ñ above, we get:
k 2 q × p entry = F DR × M
and hence:
p entry = F DR × M k 2 q (2.3.15)
We may choose F DR based on the expected number of non-zero coefficients in the final model (For example if we know the approximate sparsity beforehand, one might want to take F DR = O(1)/M ).

The approach we take, after n 2.3.15 as just indicative of the best p entry value.In our algorithm, we choose a range of values around this value.At the start of the algorithm we use the lowest value in the range and then progressively increase it (as coefficient addition gets exhausted) upto the maximum value in the range (we can also think of this as testing a range of FDR val es instead of a range of p entry values).The best value is then chosen using the extended SBC information criterion.The additional computational cost of this approach is very moderate because we are not refitting the model for every value of p entry .

A somewhat more time-intensive alterna ive is to compute the model for a suitable range of values of p entry and then use some form of cross-validation to determine the best value.


Extended SBC criterion

In our experiments with different

nformation criteria for model selection for the S
ARGS algorithm, we found that both the AIC as well as the SBC criteria were too liberal in their inclusion of coefficients.The huge number of spurious coefficients also resulted in an unacceptable slowing down of the algorithm.The Extended SBC criterion ( [4]) is based on somewhat different assumptions about the model.The usual SBC criterion is derived on the basis of the prior probability of each coefficient subset being equal.For large models, this implies that larger models are assigned a higher prior probability (since the number of different subsets for a given number of coefficients increases exponentially with the number of coefficients).In [4], the exten based on assigning equal prior probabilities to different model sizes rather than different coefficient subsets.The resulting criterion is:
SBC X (m) = SBC(m) + 2γlog(τ (M, m)), (0 ≤ γ ≤ 1) (2.3.16)
Here SBC is the usual term for Schwartz-Bayes information criterion (= −2(LogLikelihood) + mlog(N )).

The second term is the adjustment for the different assumed prior.m is

he complexity of the model
nd the quantity τ (M, m) is the number of different subsets of coefficients of size m that can be constructed from a pool of coefficients of size M .

γ is a number between 0 and 1 and must be chosen appropriately depending on whether κ = log(M )/log(N ) is approximately 1 or more (γ = 1), or κ is somewhat less than 1 (γ = 0.5) or κ is well below 1 (γ = 0).We used a smooth approximation to the step function passing through these points.

The criterion is is shown to be consistent and satisfying the condition that the probability of selection of a model other than the true model approches zero as the number of data points approaches ∞.


SVARGS coefficient confidence and size threshold

The finite number of time points implies some uncertainity in the size of the coefficients obtained.In addition, this uncertainity places a threshold of the size of the smal est coefficients that can be detected by the algorithm.In 2.3.5 we determine the uncertainity, σ c , in the estimated size of the coefficients.Coefficients not much larger than σ c have a fair chance of not being detected or being eliminated due to the natural variance of the finite sample of size N .Coefficients whose size is more than twice σ c are almost certain to be detected.Based on σ c , a rule of thumb for the threshold on the size of detectable coefficients can be taken to be:
z c = 4 2 N s e s x (2.3.17)
where s e is the standard deviation of the residual variance of the equation containing the coefficient c, and s x is the standard deviation of the variance of the regressor corresponding to c.In general, the greater the explanatory power of the regressors, the smaller the variance of the residual sum of squares and the better is the detectability of the coefficients.


General comments on SVARGS

Note that the SVARGS fit need not necessarily give the best subset of predictors in any well-defined sense.This is a well-known drawback of stepwise regression in general.There may be approximate redundancies among the lagged predictors, so that a best subset of predictors may not even exist in a strictly mathematical sense.At best, one can say that it gives coefficient locations at which the (penalized) likelihood has a local maximum.Also note that the p-values obtained as the fit progresses are only used for model comparison in order to determine the next coeffi ient location to add or remove.These p-values are entirely dependent on the path taken by the fitting process and only the relative size of consecutive p-values are relevant for the progress of the fit.Thus these p-values are not used in any way to rank the final set of model coefficients.

For the intended applications of this method, this is perfectly alright since we are mainly interested in the question: "What is the (relatively) small set of significant interactions, delayed or otherwise, between variables".A rough idea of the importance of the interactions can be gleaned from the (conditional) Granger Causality strengths.Moreover if, for some reason, it is necessary to

nk the coefficients, one can compu
e p-values from the confidence intervals (computed in section 2.3.5) for the final set of coefficients .

In view of the above, we also observe that the final model obtained was f irly robust with respect to a range of FDR values.A higher FDR led o more spurious coefficients but the prediction rate did not vary much as long as the FDR was not too large.This was found to be true, both, for data generated from sparse models as well as for real data.In our opinion, this indicates that in our app ications (and we suspect often in many other real world applications), there is a wide gulf between those interations that are significant and those that are n t, with very little in the grey area.Note however that the FDR increases the computation time by increasing the coefficient density as well as the reversal-ratio -the ratio of the number of backward steps to the total number of non-zero coefficients (see supporting information of [5] under the section time complexity).In particular, having a somewhat low FDR can speed up the fit considerably.

In addition, the method works best

both in terms of quality of fit and t
me complexity, when the final model is known to be sparse with the number of non-zero coefficients being of order O(mk) with m < < N and m < < kq (where e number of lags and N is the nu It is not suitable for dense or even semi-sparse models.As the final coefficient count increases, the time complexity of th [5] under the section on time complexity for a discussion of the time complexity of the algorithm.


Quantities derived from VAR model

In order to find the Granger causalities (in both the time and frequen

domains) from the fitted model
we first need to compute the following quantities using the fitted model:

1. ACVF: Matrix of variance and cross-covariance at any given lag.

2. Sub-Model Coefficients: Given a subset of variables and a subset of lags, compute a VAR m ta" is to be explained by only the specified sub variables and the specified subset of the original set of lags.

3. Transfer Function: Transfer function of the operator associat VAR model.

4. Spectrum: Matrix of power and cross spectra at any given frequency).

To compute the Granger causali ver all frequencies of the GC spectrum is identical to the GC strengths.In the following sections, we describe how to calculate these quantities and subsequently the granger causalities.


ACVF, Transfer Function and Spectrum

Given the VAR model Main [2] of order p in k variab − 1) (3.1.1)
This yields the extended model:
X t = AX t−1 (3.1.2)
where the exte ded vector X is the column vector of length kp and the extended

tation.Ho
ever it can be used to conveniently derive more efficient formulas to compute the above mentioned quantities.We do this below.


Autocovariance Function nd by multiplying both sides of the equation first by X t and then by X t−1 and taking expectations on both sides.The resulting two equations are:
G = AG T + Σ G = AG T
which can also be written as a single equation:
G = A T GA + Σ (3.1.3)
The matrix Σ is zero except for the top-left (k × k) block which is the same as the noise covariance of the origin respectively:
G =       G 0 G 1 . . . G p−1 G −1 G 0 . . . G p−2 . . . . . . G −p+1 G −p+2 . . . G 0       , G =       G −1 G 0 . . .

p−2 G −2 G
1 . . . G p−3 . . . . . . G −p G −p+1 . . . G −1      
Here G j represents the autocovariance at lag j of the original variable Y t .Note that G −m = G T m .Substituting these into the equations for the extended autocovariance and performing the block multiplications, we get, after some matrix algebra, the following p equations:
G j = j i=1 A i G j−i + p i=j+1 A i G T i−j , j = 1, 2, . . . , p − 1 (3.1.4) G 0 = p−1 i=1 A i G T i + A p p i=1 {A i G p−i } T + Σ (3.1.5)
For a desc .1.3)or (3.1.4)see section 5.

Note: In the remainder of this Transfer Function

The Transfer function, H p (ω) is calculated in a straightforward way from the formula:
L p (ω) = I − α(ω)A 1 − α 2 (ω)A 2 − . . . − α p (ω)A p (3.1.6) H both sides of 3.1.2and averaging, we obtain the pectrum of Y as the top-left (k × k) sub-block of the matrix function:
S X (ω) = −G + * (ω) (3.1.9)
where A is the extended coefficient matrix of 3.1.2,G is the covariance matrix (autocovariance at lag 0) of t e extended vector X and H(ω) is the corresponding Transfer function of the extended model 3.1.2.To find the top-left (k × k) sub-block in 3.1.8,we solve the resulting blo ) = −G 0 + T (ω) + T * (ω) (3.1.10) with T (ω) = H p (ω (ω) − L p (ω))G j    (3.1.11)
where G j is the autocovariance of Y at lag j, and L j (ω) is the expression 3.1.6whose inverse is the Transfer Function H j (ω) for an order j model.


Sub-models

Given the VAR model for the full set of variables and some given set of lags L = {1, 2, . . ., p}, where p is the order of the model, we now describe how to compute the VAR submodel for a subset of the variables and a subset of the lags.More precisely, we would like to compute the VAR model as if all the "data" is to be explained by only the specified subset of the original set variables and t note the subset of variables for which we require he sub-model.Let Z denote the complement of X in W . Without loss of generality we may assume that for some s < k, X = {w (1) , w (2) , . . ., w (s) } (3.2.2)
Z = {w (s+1) , . . . , w (p) } (3.2. ermute only to X, and µ x t is the noise pertaining only to to X. Also denote by J the subset of the original lags (L) over which the fit is required, and let M denote the complement of J in L. We can then write the equations for X as:
X t − i∈J B xx i X t−i = i∈M B xx i X t−i + i∈L B xy i Z t−i + µ x t (3.2.7)
and for the purpose of fitting the submodel, we write:
X t − i∈J C i X t−i = ν t (3.2.8)
The right hand side of (3.2.8) is the 'noise' which is, by hypothesis, uncorrelated with X(t − i) for i ≥ 1.We therefore have to "fit" the coefficients C i to the model represented by (3.2.8).The autocovariance f nction (which can be thought of as the autocovariance of the "data") can be calculated directly from the coefficients of the original model.We can therefore use the Yule-Walker equations for the original model (restricted to the components in X) to fit coefficients C i in (3.2.8) and then compute the new noise term

t .From equations (
.2.7) and (3.2.8), this noise term is:
ν t = i∈J (B xx i − C i )X t−i + i∈M B xx i X t−i + i∈L B xy i Z t−i + µ x t (3.2.9) = i∈L (B xx i − C i 1 J i )X t−i + i∈L B xy i L F x i W t−i + µ x t (3.2.11)
where
F x i = [B xx i

C i 1 J i , B xy i ]
, and
J is the indicator function of J on L.

As per the original model hypothesis, the W t−i , for i ≥ 1, are not correlated with µ x t .Using this fact, and after some algebraic simplification, the covariance matrix o ) G =           G 0 G 1 G 2 . . . G p+3 G −p+4 . . . G 1 G −p+1 G −p+2 G −p+3 . . .
)
where G h is the autocovariance of the original full set of variables W at lag h and Σ µ x is the covariance matrix of µ (x) .

This allows us to calcu it involves no further calculation other than sol lker equations restricted to the subset X, and computing the new noise term of the Yule-Walker equations, a one time computation of the autocovariances of the full model as per 3.1.4suffices for all submodels.We discuss an efficient implementation of the subfit computations above in section 5.

Using the procedures outlined in sections 3.1 and 3.2, we are now ready to calcula

the Granger Causality strength
and spectra as below.For more details on how to obtain the Granger Causality 74].For insight into the Define the Lag operator L by L(X t ) = X t−1 .In the remainder of this section and the next we will find it convenient to write VAR equations Main [2] in operator form:
[I − A = A 1 .L 1 + A 2 .L 2 + . . .+ A p .L p .


Time Domain (GC strengths)

Let W der the same general conditions as in equation Main [2].Suppose that W is composed of two sub-processes X and Y with W = [X, Y ].We can then represent W as:
X t = p i=1 B xx X t−i + p i=1 B yy i Y t−i + y t
We write these in operator form as in 3.3.1:
I x 0 0 (3.3.2)
with the error covariance matrix for given by:
T = T xx T xy T xy T yy (3.3.3)
In addition, assume that the two sub-processes, X and Y , individually satisfy the same general assumptions as in Main [2] so that we can write them as:
I

0 0 I y − A x (L) 0 0 A y (L) .
X t Y t = δ x t δ y t
with the error covariance matrix for δ given by:
Σ = Σ x 0 0 Σ y
In the time domain, the Granger Causality F Y →X from Y to X is defined by:
F Y →X = ln |Σ x | |T xx | (3.3.4)
where |.| denotes the determinant.The Granger Causality from X to Y is defined analogously.Having fitted a VAR model to the full set of data, the GC strengths can be computed using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.


Frequency Domain (GC spectrum)

First left multiply both sides of equation (3.3.2) by the normalizing matrix:
I x 0 −T xy T yy −1 I y (3.3.5)
The resulting equations:
I x 0 Byx 0 I y − B xx (L) B xy (L) Byx (L) Byy (L) . X t Y t = ˜ x t ˜ y t (3.3.6)
include a projection of Y onto not only the past, but also the prese

of X, in such a way that the cross cov
riance between the X and Y blocks of the new noise terms vanishes:
Cov(˜ ) = T xx 0 0 Tyy (3.3.7)
Because of this independence, the spectrum S (xx) of X can be decomposed as:
H xx (ω)T xx H xx * (ω) + H xy (ω)H xx (ω) = H xx (ω) + H xy (ω)T xy {T xx } −1 (3.3.8)
The Granger Causality, f Y →X ) | | H xx (ω)T xx H xx * (ω)| (3.3.9)
As in the cas R model is fit to the full set of data, the GC spectrum can be computed usin e can take pairs of time series components and compu pairwise analysis, but it has inherent limitation .We do not know if Y is genuinely driven by X or only appears to be driven diary Z which transmits the influence to Y .To determine whether a causal influence from variable X to Y is partially or wholly mediated by some other variable Z, we need the concept of conditional Granger Causality, both in the time and in the spectral domains.We briefly mention the relevant equations here.For more deta

s on how to obtain these formulas, see [8]
r [9, pg. 228-237] and the references therein.For insight into the reasoning and rationale behind the definitions below, see [10].


Time Domain (conditional GC strengths)

Consider three stochas ic processes X t , Y t and Z t .Suppose that a pairwise analysis shows Granger Causality from Y to X.To determine whe ly or wholly due to the presence of Z, first write the VAR model equations for the processes X t and Z t taken together:
I x 0 0 I z − B xx (L) B xz (L) B zx (L) B zz (L) . X t Z t = x t z t (3.4.1)
Denote the covariance and Z taken together:
    I x 0 0 0 I y 0 0 0 I z   −   C xx (L) C f error covariance terms by:
Υ =   Υ xx Υ xy Υ xz Υ yx Υ yy Υ yz Υ zx Υ zy Υ zz   (3.4.4)
where Υ yx = Υ xy , Υ zx = Υ xz and Υ zy = Υ yz .

Then the Granger Causality from Y to X, conditional on Z is defi using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.


Frequency Domain (conditional GC spectrum)

The computation of the Conditional I x , I y and I z denote identity matrices of the appropriate block size.

As in the case of the pairwise GC spectrum, first multiply both sides of equation (3.4.1) by the n rmalizing matrix:
P 0 = I x 0 −T xz T xx −1 I z
This projects the v riable Z onto the past as well as the present of X in a way that absorbs the cross covariance between the noise terms of X and Y .Therefore, the resulting equations have independent noise covariance:
I x 0 Bzx 0 I z − B xx (L) B xz (L) Bzx (L) Bzz (L) . X t Z t = ˜ x t ˜ z t (3.4.6) Cov(˜ ) = T xx 0 0 Tzz (3.4.7)
Similarly multiply both sides of equation (3.4.3) by P [ = P 2 .P 1 ] where
P 1 =   I x 0 0 −Υ yx Υ xx −1 I y 0 −Υ zx Υ xx −1 0 I z   , P 2 =   I x 0 0 0 I y 0 0 −(Υ zy − Υ zx Υ xx −1 Υ xy )(Υ yy − Υ yx Υ xx −1 Υ xy ) −1 I z  
Again, this projects the variable Y onto the past and present of X, and the variable Z onto the past and present of both X and Y , in such a way that the resulting equations have independent noise covariance:
    I x 0 0 Cyx 0 I y 0 Czx 0 Czy 0 I z   −   Cxx (L) Cxy (L) Cxz (L) Cyx (L) Cyy (L) Cyz (L) Czx (L) Czy (L) Czz (L)     .   X

Y t Z t   =   φx t φy t φz t   (3.4.
) Cov( φ) =   Υxx 0 0 0 Υyy 0 0 0 Υzz   (3.4.9)
Now let Hb (ω) and Hc (ω), respectively, denote the transfer functions, as defined in section 3, of the operators on the left hand sides of the equations (3.4.6) and (3.4.8).Then define the matrix Q(ω) by: The Conditional Granger Causality Spectrum from Y to X conditional on Z is then defined as:
  Q xx Q xy Q xz Q yx Q yy Q yz Q zx Q zy Q zz   (ω) =   Hb xx(f Y →X|Z = ln |T xx | |Q xx (ω) Υxx Q xx * (ω)| (3.4.11)
For more details and insight into these formulas, see [10].The Conditional Granger Causality Spectrum f om X to Y conditional on Z is defined analogously.As in the case of the pairwise GC spectrum, once a VAR model is fit to the full set of data, the conditional GC spectrum can be computed using the above formulas and the formulas in section 3.2 for the coefficients and noise covariance of the sub-fitted models.


Consistency bet

en GC strengths and spectrum

Fo
the pairwise granger causalities and spectrum , the consistency condition which ensures the quality of f Y →X and ω f Y →X (ω) can be stated as follows:

The pairwise GC spectrum is consistent with the GC strengths only if the VAR operator Byy in equations
(3.3.6) is stable. If Byy is not stable, then ω f Y →X (ω) < f Y →X .
For conditional Granger Causality, the corresponding condition is:

Let L yz Q be the Y Z block of the VAR operator whose transfer function is the left hand side Q of equation (3.4.10).Then the conditio al GC spectrum is consistent with the conditional GC strengths only if
L yz Q is stable, otherwise ω f Y →X|Z (ω) < f Y →X|Z .
For the granger causalities in the opposite directions, reverse the roles of X and Y everywhere.For proofs of these see [7] and [10] respectively.

Note that to compute all pairs of granger causalities in a multivariable model, one would use the sub-fitting procedure described in section 3.2 on each pair of variables (or each pair of blocks as the case may be) and then use the formulas for the granger causalities above.For the conditional granger causalities, to get f Y →X|Z , subfits must be p

formed for the Y Z and XY
blocks.Thus the GC consistency condition must be checked separately for each pair of interacting blocks.


GC consistency condition for random models

Our numerical tests (on pairwise GC) show that random dense models, are unlikely to satisfy the GC consistency condition for all interactions.On generating a large number of random dense models, we found that the probability that a random dense model satisfies the consistency condition, decreases as the model size gets larger and in general this probability is a very small but finite value.On the other hand, similar tests on sparse models show that random sparse models (of the sparsity levels that concern us here) satisfy the consi tency condition with high probability.

The practical implication of this is that for sufficiently dense VAR models, there will be large discrepancies between the integrated GC spec rum and the GC strengths for those interactions that d

e discrepancies become partic
larly egregious as the coefficient density becomes high.


Additional performance measures

We include a few more performance measures here in addition to the ones in section Main [3].

The reference models against which the tests were made were all checked for the pairwise consistency condition stated in section 3.5] and they were, as expected, all GC-consistent with respect to both Pairwise and Conditional Gr

have a high probability of
assing the GC consistency test, while dense models have a fairly high probabilty of failing it for at least some interactions.

The models obtained via SVARGS+CGC were all stable (or required a very small deflation (< 1%) of the coefficient size to make them stable) and all satisfied the GC consistency condition for Pairwise GC.We did not check the theoretical consistency condition for the conditional GC, because it would have involved approximating a theoretically infinite order VAR operator from its transfer function.However, the normalized difference between the Conditional GC strengths and sum of the Conditional GC spectrum was less than 0.05 i all cases which strongly suggests that they satisfy the Conditional GC consistency condition.


Error Measure Definitions
Percentage of unpredicted coefficients = 100 × |L r L c f (M )|/|L r Normalized RMS error in predicted coefficients = M −1/2 (r i − f i )/r i , i ∈ L r 2 Normalized bia in predicted coefficients = M −1 i∈Lr (r i − f i )/r i Normalized RMS error in noise covariance = Σ r − Σ f s / Σ r s

Performance with Added Noise

We also computed VAR models from simulated data to which simulated external (measurement) noise had been added.The standard deviation of the simulated noise used were 0, 0.2, 0.4 and 0.6 of the standa hods were evaluated on 10 randomly generated VAR models.Figure 6 shows the performance f SVARGS in the presence of added (measurement) tation


Autocovariance computation

There is no convenient closed form expression for the autocovariances G j in (3.1.4).For smaller systems (< 200 variables or so) we can efficiently compute the G j by setting up the coefficients in the matrix equations and determining, programm tically, the matrix corresponding to the linear equation to be solved.This is much faster than trying t

directly use the equations for extended autocovariance.However f
r larger systems, this becomes inefficient.We then use an iterative solver to solve the extended autocovariance equations (3.1.3),which are in the form of Discrete Lyapunov Equations.The Matlab dlyap method can be used to solve these.

Another faster option is to use a power doubling method described in [11] modified by us to take into account the special structure of the matrix A. This is particularly suitable for sparse models and is much faster than dlyap.Both these methods were used successfully in our applications.

Another approach that works well for dense models is to diagonalize the block companion matrix A in the expression for the extended autocovariance in section 3.1.1,as A = PΛP −1 , where Λ is a diagonal matrix with terms l i on the diagonal and P is the matrix of eigenvectors of A.

The equation (3.1.3)for the extended autocovariance then becomes:
G = Λ T GΛ + Σ (5.1.1)
where G = P −1 GP T −1 and Σ = P −1 ΣP T −1 .The matrix A must be non-defective upto numerical tolerance for this to work properly.
The equation (5.1.1)has the explicit solution:
G = PTP T (5.1.2)
where ij = Σij /(1 − l i lj).
This works better for dense rather than sparse matrices since a dense A is much more likely to be nondefective.For defective A, a similar explicit solution can be found in terms of the jordan canononical form, but this is impractical to compute for all but the smallest matrices.

Note that the iterative methods for computing the autocovariance require that the VAR mod tion 3.2, when computing conditional granger causalities, a "leave one block out" subfit needs to be performed for each of the blocks where the conditional Granger Causality is required.We compute the conditional Granger Causality between two variables (or two, typically enerally, this is the situation where for each subfit, the subset of variables X to be subfit contains most of the variables in the system W (m

ning, in each case, the subset of variables Z to be left out is s
all in relation to W).In such cases the subfit can be computed quite efficiently, without having to invert a large autocovariance matrix each time to find C and without having to perform the large matrix multiplication given by equation (3.2.12).This is done by computing the differentials in the noise and the coefficients rather than the actual quantities themselves.We state this result in a proposition (refer to section 3.2 for notation):

Proposition 1 (Noise delta and Coefficient delta for VAR model subfit).

Let G be as in equation (3.1.3).Denote the inverse of G by Q.Let X (with k x variables) and Z (with k z variables) be a partition of the variables into two blocks as in equations (3.2.2).Let G x be the (k x × k x ) submatrix of G tha has the lag-extended autocovariance terms of X.Let g be the (k z × k x ) submatrix of G that has the lag-extended cross-covariance between Z and X so that:
G = G x g g T g z Q = Q x q q T q z (5.2.1) Let B xx = vertcat(B xx i ), let B zx = vertcat(B zx i )
, and µ x be the noise, in the equations for the X variables as in equ tion (3.2.5).Let C = vertcat(C i ) be the subfitted coefficients for X, and let ν be the sub-fitted noise as in equation (3.2.8).Then the coefficient delta and noise delta are given, respectively b :
C − B xx = −(qq −1 z B zx ) T (5.2.2) Σ ν − Σ µ x = B zxT q −1 z B zx (5.2.3)
The proof of (5.2.2) involves writing the equations for the coefficients in block form and is straightforward but tedious, so we omi

it here.


Fast com
utation of conditional GC due to known sparse locations

The sizes of the variable blocks between which the conditional GC is to be computed are usually small.So the number of rows in B zx and the number of columns in q are small (for atomic blocks, it is exactly the number of lags).Similarly q z and g z are just small square matrices.In addition, equatio (5.2.2) also tells us that the conditional GC from block i to block j is only non-zero if, in the original equations, the coefficient block (j, i) is non-zero at some lag.This, further, greatly reduces the amount of work required.The only expensive computation is the one-time calculation of the i verse, Q, of G. Therefore, computing the coefficient and noise delta's using (5.2.2) is much faster than using the full equations in section 3.2.

If, on the other hand, the variable blocks are large, then the total number of GC values to be cal ulated is small (since the number of GC values is the square of the number of blocks), and so the of cost computing the new coefficients by inverting the covariance matrix corresponding to the complement of Z is not so relevant.

In our implementation, we use equations (5.2.2) if the size of the variable block is less than k/2.Otherwise we directly use the equations is section 3.2.Here k is the total number of variables.

Note that computing the subfits require the computation of the autocov riance.As we remarked in the previous section, if an iterative method is used to compute the autocovariance, then the model must be stable.


SVARGS computation

A fast implementation of the main computations involved in the SVARGS method is described in the supplementary information of [5], under the section efficient implementation of the algorithm.The same implementation can be easily extended to variable blocks as long as the blocks are all of the same size 7 Measure and node (channel) preselection procedure

To reduce the number of measures/nodes, start with the full set of features F -this is the union of the features over all the feature blocks (Each feature block corresponds to one measure or one node).Set the initial loss L 0 = 1.We use the following iterative procedure briefly described below:

1. Randomly partition the dataset into an appropriate number K of folds.

2.
Then, for each test fold, train a model using the features in F on the corresponding training fold using Adaboost-SAMME upto the required number of rounds.

3. Use the model obtained to classify the test fold and record the predictions.This step will involve c

culating all the individua
gains at every split on the test fold.

4. Since the individual gains are now calculated for each measure/node, average the gain of the corresponding feature block over all rounds.

5. Perform multiple runs of the adaboost-SAMME algorithm by repeating steps 1 to 4 for different K-fold partitions of the data and average the resulting gains and compute th aggregate loss L F from the predictions over all the runs.

6.If L F > L 0 , STOP.

7. Prune the feature blocks by clustering the set of gains into high gains and low gains.The heuristic used for doing this was as follows:

(a) Sort the gain values in decreasing order resulting in a curve G.

(b) Put the high outliers in G in the high gain cluster, the low outliers and zeros in the low gain cluster and remove all these from G.

(c) Find the index x 0 where the slope of G is minimum.Impose a lower bound x 0 = max(M/2, x 0 ) on x 0 where M is the number of elements in G.

(d) Add G(x), x < x 0 to the high gain cluster and put the rest in the low gain cluster.

8. Remove the feature bl cks that have low gain.If no features have been removed, STOP.

9. Let F be the union of the features over the high gain feature blocks, let L 0 = L F and return to step 1.

The features selected are then the ones remaining in F.

8 Application to DEAP and ADHD Datasets )).

2. The recording was passed to the PREP pipeline along with the boundary event markers described above.
he PREP pipeline was setup to perform the following